The iRefIndex is a collection of protein interactions databases providing and index of canonical interaction pairs and references to the database providing evidence for the interaction. The purpose of this notebook is to extract a binary feature for each database integrated into iRefIndex. These databases are:

  • BIND
  • BioGRID
  • CORUM
  • DIP
  • HPRD
  • InnateDB
  • IntAct
  • MatrixDB
  • MINT
  • MPact
  • MPIDB
  • MPPI
  • OPHID

To extract this feature we will iterate over the table and use each Entrez Gene protein pair as a key to index the database referring to each entry:


In [1]:
cd ../../iRefIndex/


/home/gavin/Documents/MRes/iRefIndex

In [4]:
import csv

In [13]:
import pdb

In [24]:
f = open("9606.mitab.08122013.txt")
c = csv.reader(f,delimiter="\t")
irefindexdict = {}
for l in c:
    #extract Gene IDs
    gids = []
    for x in [l[2],l[3]]:
        for s in x.split("|"):
            s = s.split(":")
            if s[0]=="entrezgene/locuslink":
                gids.append(s[1])
    #only add entry to dictionary if there is a pair of Gene IDs
    if len(gids) == 2:
        try:
            irefindexdict[frozenset(gids)] += [l[12]]
        except KeyError:
            irefindexdict[frozenset(gids)] = [l[12]]
f.close()

Now we find the strings corresponding to unique databases:


In [26]:
uniqdbs = list(set(flatten(irefindexdict.values())))
print uniqdbs


['MI:0465(dip)', 'MI:0469(intact)', 'MI:0463(biogrid)', 'MI:0468(hprd)', 'MI:0000(corum)', 'MI:0000(mppi)', 'MI:0462(bind)', 'MI:0917(matrixdb)', 'MI:0000(bind_translation)', 'MI:0000(ophid)', 'MI:0974(innatedb)']

Using these we can create a dictionary using the same keys as above but using a 1-of-k coding for each database:


In [27]:
ireffeaturedict = {}
for k in irefindexdict.keys():
    fvector = []
    for db in uniqdbs:
        if db in irefindexdict[k]:
            fvector.append("1")
        else:
            fvector.append("0")
    ireffeaturedict[k] = fvector

Saving the results

These results will be saved in two ways:

  • First, the results will be saved to a file using the above unique database identifiers as column labels
  • Second, the dictionary will be pickled in a class specifically for iRefIndex and this will be saved to be loaded to build feature vectors

In [29]:
f = open("human.iRefIndex.Entrez.1ofk.txt", "w")
c = csv.writer(f,delimiter="\t")
c.writerow(["protein1","protein2"]+uniqdbs)
for k in ireffeaturedict.keys():
    pair = list(k)
    if len(pair) == 1:
        pair = pair*2
    c.writerow(pair + ireffeaturedict[k])
f.close()

In [30]:
!head human.iRefIndex.Entrez.1ofk.txt












In [31]:
import sys

In [32]:
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")

In [35]:
import ocbio.irefindex

In [37]:
features = ocbio.irefindex.features(ireffeaturedict)

In [38]:
import pickle

In [39]:
f = open("human.iRefIndex.Entrez.1ofk.pickle","wb")
pickle.dump(features,f)
f.close()